Introduction

This analysis will focus on building a regression model to predict house sale prices using Kaggle data. The process involves identifying key predictors, interpreting their impact, and validating the model.

Model Creation

library(tidyverse)
library(dplyr)
library(pander)

train = read.csv("train.csv", stringsAsFactors = TRUE)

The first thing I did was look at the pairs plot to see all notable patterns that had good slop.

This is what I found:

Slopes that look like they could be useful:

  • GarageArea
  • KitchenQual
  • TotRmsAbvGrd
  • GrLivArea
  • X2ndFlrSF
  • X1stFlrSF
  • TotalBsmtSF
  • MasVnrType
  • OverallQual

Smaller but potentially useful slops:

  • SaleCondition
  • MiscFeature
  • EnclosedPorch
  • PavedDrive
  • GarageYrBlt
  • GarageFinish
  • FullBath
  • BsmtUnfSF
  • BsmtQual
  • ExterQual


After that I looked at the data dictionary and looked for stuff that seemed like it might be useful logicly.

Here is what I found:

This could help with location

  • MSZoning

Both could have a lot of variables wrapped into one

  • OverallQual

  • OverallCond

This could show economic things because of timing

  • YearBuilt

  • YearRemodAdd

  • YrSold

  • MoSold

This might be able to show how having a basement/finished basement contributes

  • BsmtFinSF1

  • BsmtUnfSF

The number of baths total could be useful

  • BsmtFullBath

  • BsmtHalfBath

  • FullBath

  • HalfBath


I then used all of that new knowledge I found and starting making linear models. These models had various success rates. Bellow you can see a sample of some of the models I tried.

mylm1 = lm(SalePrice ~ GarageArea + KitchenQual + TotRmsAbvGrd + GrLivArea + OverallQual, data = train)
pander(summary(mylm1))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 9356 9630 0.9716 0.3314
GarageArea 63.35 5.901 10.74 6.338e-26
KitchenQualFa -69519 8249 -8.428 8.372e-17
KitchenQualGd -53291 4413 -12.08 4.637e-32
KitchenQualTA -68226 5046 -13.52 2.526e-39
TotRmsAbvGrd -1757 1110 -1.583 0.1136
GrLivArea 52.89 3.899 13.57 1.47e-39
OverallQual 21394 1146 18.68 6.296e-70
Fitting linear model: SalePrice ~ GarageArea + KitchenQual + TotRmsAbvGrd + GrLivArea + OverallQual
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 38268 0.7691 0.768
mylm2 = lm(SalePrice ~ SaleCondition + X2ndFlrSF + X1stFlrSF + TotalBsmtSF, data = train)
pander(summary(mylm2))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -27636 6172 -4.478 8.143e-06
SaleConditionAdjLand 6546 24018 0.2725 0.7852
SaleConditionAlloca 7555 14605 0.5173 0.605
SaleConditionFamily -9272 11519 -0.8049 0.421
SaleConditionNormal 21874 4877 4.485 7.865e-06
SaleConditionPartial 69847 6432 10.86 1.829e-26
X2ndFlrSF 82.07 2.889 28.41 1.558e-141
X1stFlrSF 82.13 5.689 14.44 3.08e-44
TotalBsmtSF 57.42 5.041 11.39 7.594e-29
Fitting linear model: SalePrice ~ SaleCondition + X2ndFlrSF + X1stFlrSF + TotalBsmtSF
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 47039 0.6513 0.6494
mylm3 = lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + YearRemodAdd + YrSold + MoSold, data = train)
pander(summary(mylm3))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 499204 1924199 0.2594 0.7953
OverallQual 40654 1175 34.61 6.353e-192
OverallCond 985.9 1330 0.7415 0.4585
YearBuilt 217.5 64.37 3.379 0.0007474
YearRemodAdd 261.5 85.98 3.041 0.0024
YrSold -756 958.9 -0.7885 0.4306
MoSold -234 471.2 -0.4967 0.6195
Fitting linear model: SalePrice ~ OverallQual + OverallCond + YearBuilt + YearRemodAdd + YrSold + MoSold
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 48001 0.6364 0.6349
mylm4 = lm(SalePrice ~ BsmtFullBath + BsmtHalfBath + FullBath + HalfBath, data = train)
pander(summary(mylm4))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 23972 5092 4.708 2.739e-06
BsmtFullBath 42546 3063 13.89 2.813e-41
BsmtHalfBath 18980 6652 2.853 0.004389
FullBath 79592 2883 27.61 2.954e-135
HalfBath 34458 3147 10.95 7.276e-27
Fitting linear model: SalePrice ~ BsmtFullBath + BsmtHalfBath + FullBath + HalfBath
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 59871 0.4336 0.432
mylm5 = lm(SalePrice ~ OverallQual:KitchenQual:OverallCond, data = train)
pander(summary(mylm5))
Table continues below
  Estimate Std. Error t value
(Intercept) 77804 5902 13.18
OverallQual:KitchenQualEx:OverallCond 5712 178.2 32.05
OverallQual:KitchenQualFa:OverallCond 1259 359 3.506
OverallQual:KitchenQualGd:OverallCond 3581 162.8 21.99
OverallQual:KitchenQualTA:OverallCond 2068 193 10.71
  Pr(>|t|)
(Intercept) 1.421e-37
OverallQual:KitchenQualEx:OverallCond 5.574e-171
OverallQual:KitchenQualFa:OverallCond 0.0004687
OverallQual:KitchenQualGd:OverallCond 9.659e-93
OverallQual:KitchenQualTA:OverallCond 7.801e-26
Fitting linear model: SalePrice ~ OverallQual:KitchenQual:OverallCond
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 53992 0.5394 0.5381

After trying those, I finally thought to use the variables we did in class, as well as combining columns to perhaps make my own new columns.

You can see below what I did.

train <- train %>%
  mutate(TotalSF = X1stFlrSF + X2ndFlrSF + TotalBsmtSF)

train <- train %>%
  mutate(Totalbaths = BsmtFullBath + BsmtHalfBath + FullBath + HalfBath)

train <- train %>%
  mutate(renovatedyn = ifelse(YearRemodAdd == YearBuilt, "No", "Yes"))

train <- train %>%
  mutate(age_at_sale = YrSold - YearBuilt)

train <- train %>%
  mutate(RichNbrhd = case_when(Neighborhood %in% c("StoneBr", "NridgHt", "NoRidge") ~ 1,
                             TRUE ~ 0))
train <- train %>%
  mutate(OverallQual_group = ifelse(OverallQual <= 5, 5, OverallQual))
mylm6 = lm(SalePrice ~ TotalSF, data = train)
pander(summary(mylm6))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -13220 4251 -3.11 0.001909
TotalSF 75.63 1.577 47.95 5.511e-302
Fitting linear model: SalePrice ~ TotalSF
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 49506 0.6119 0.6117
mylm7 = lm(SalePrice ~ TotalSF + Totalbaths + MSZoning + renovatedyn + age_at_sale, data = train)
pander(summary(mylm7))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -5369 14760 -0.3638 0.7161
TotalSF 58.5 1.727 33.87 8.489e-186
Totalbaths 10425 1650 6.319 3.505e-10
MSZoningFV 34099 15104 2.258 0.02412
MSZoningRH 15329 17441 0.8789 0.3796
MSZoningRL 29896 13917 2.148 0.03186
MSZoningRM 24092 13999 1.721 0.08546
renovatedynYes 18960 2534 7.483 1.25e-13
age_at_sale -742.2 50.87 -14.59 4.316e-45
Fitting linear model: SalePrice ~ TotalSF + Totalbaths + MSZoning + renovatedyn + age_at_sale
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 43188 0.7061 0.7045
mylm8 = lm(SalePrice ~ TotalSF + OverallQual + Totalbaths + MSZoning + renovatedyn + YearBuilt + age_at_sale, data = train)
pander(summary(mylm8))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 292076 1525175 0.1915 0.8482
TotalSF 38.58 1.835 21.03 6.996e-86
OverallQual 22794 1150 19.82 1.596e-77
Totalbaths 9224 1468 6.284 4.347e-10
MSZoningFV 6077 13482 0.4507 0.6522
MSZoningRH 1587 15507 0.1023 0.9185
MSZoningRL 14203 12380 1.147 0.2515
MSZoningRM 590 12488 0.04725 0.9623
renovatedynYes 13229 2267 5.834 6.654e-09
YearBuilt -188.5 759.6 -0.2481 0.8041
age_at_sale -491.5 757.5 -0.6488 0.5165
Fitting linear model: SalePrice ~ TotalSF + OverallQual + Totalbaths + MSZoning + renovatedyn + YearBuilt + age_at_sale
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 38334 0.7687 0.7672

I then looked over all my models and choice the following model.

Formula:

\[ \begin{aligned} SalePrice = & \ \beta_0 + \beta_1 (\text{RichNbrhd}) + \beta_2 (\text{TotalSF}) + \beta_3 (\text{OverallQual\_group}) \\ & + \beta_4 (\text{RichNbrhd} \times \text{TotalSF}) + \beta_5 (\text{RichNbrhd} \times \text{OverallQual\_group}) \\ & + \beta_6 (\text{TotalSF} \times \text{OverallQual\_group}) + \beta_7 (\text{RichNbrhd} \times \text{TotalSF} \times \text{OverallQual\_group}) + \epsilon \end{aligned} \]

Model Summary:

mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train) 
pander(summary(mylm_80))
Table continues below
  Estimate Std. Error
(Intercept) -694411 61187
RichNbrhd -568990 47159
TotalSF 33.76 2.194
as.factor(OverallQual_group)6 -13805 8351
as.factor(OverallQual_group)7 -22768 10043
as.factor(OverallQual_group)8 -26973 17308
as.factor(OverallQual_group)9 -48468 112525
as.factor(OverallQual_group)10 418510 28134
YearBuilt 384.2 31.3
RichNbrhd:TotalSF 148.3 8.66
RichNbrhd:as.factor(OverallQual_group)6 733242 974427
RichNbrhd:as.factor(OverallQual_group)7 487985 60834
RichNbrhd:as.factor(OverallQual_group)8 570124 53513
RichNbrhd:as.factor(OverallQual_group)9 397519 128680
TotalSF:as.factor(OverallQual_group)6 12.83 3.509
TotalSF:as.factor(OverallQual_group)7 23.17 3.782
TotalSF:as.factor(OverallQual_group)8 32.17 5.52
TotalSF:as.factor(OverallQual_group)9 50.69 31.94
TotalSF:as.factor(OverallQual_group)10 -64.14 4.555
RichNbrhd:TotalSF:as.factor(OverallQual_group)6 -206.5 364.8
RichNbrhd:TotalSF:as.factor(OverallQual_group)7 -118.2 15.29
RichNbrhd:TotalSF:as.factor(OverallQual_group)8 -143.5 11.32
RichNbrhd:TotalSF:as.factor(OverallQual_group)9 -98.19 34.61
  t value Pr(>|t|)
(Intercept) -11.35 1.195e-28
RichNbrhd -12.07 5.353e-32
TotalSF 15.39 1.436e-49
as.factor(OverallQual_group)6 -1.653 0.09853
as.factor(OverallQual_group)7 -2.267 0.02354
as.factor(OverallQual_group)8 -1.558 0.1194
as.factor(OverallQual_group)9 -0.4307 0.6667
as.factor(OverallQual_group)10 14.88 1.162e-46
YearBuilt 12.27 5.277e-33
RichNbrhd:TotalSF 17.13 5.419e-60
RichNbrhd:as.factor(OverallQual_group)6 0.7525 0.4519
RichNbrhd:as.factor(OverallQual_group)7 8.022 2.147e-15
RichNbrhd:as.factor(OverallQual_group)8 10.65 1.46e-25
RichNbrhd:as.factor(OverallQual_group)9 3.089 0.002045
TotalSF:as.factor(OverallQual_group)6 3.657 0.0002647
TotalSF:as.factor(OverallQual_group)7 6.128 1.151e-09
TotalSF:as.factor(OverallQual_group)8 5.828 6.917e-09
TotalSF:as.factor(OverallQual_group)9 1.587 0.1127
TotalSF:as.factor(OverallQual_group)10 -14.08 2.864e-42
RichNbrhd:TotalSF:as.factor(OverallQual_group)6 -0.5661 0.5714
RichNbrhd:TotalSF:as.factor(OverallQual_group)7 -7.726 2.077e-14
RichNbrhd:TotalSF:as.factor(OverallQual_group)8 -12.67 5.661e-35
RichNbrhd:TotalSF:as.factor(OverallQual_group)9 -2.837 0.004615
Fitting linear model: SalePrice ~ RichNbrhd * TotalSF * as.factor(OverallQual_group) + YearBuilt
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 28884 0.8698 0.8678




Model and Valadation

Bellow is the model I choice to use for the purposes of this analyses.

mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train) 
pander(summary(mylm_80))
Table continues below
  Estimate Std. Error
(Intercept) -694411 61187
RichNbrhd -568990 47159
TotalSF 33.76 2.194
as.factor(OverallQual_group)6 -13805 8351
as.factor(OverallQual_group)7 -22768 10043
as.factor(OverallQual_group)8 -26973 17308
as.factor(OverallQual_group)9 -48468 112525
as.factor(OverallQual_group)10 418510 28134
YearBuilt 384.2 31.3
RichNbrhd:TotalSF 148.3 8.66
RichNbrhd:as.factor(OverallQual_group)6 733242 974427
RichNbrhd:as.factor(OverallQual_group)7 487985 60834
RichNbrhd:as.factor(OverallQual_group)8 570124 53513
RichNbrhd:as.factor(OverallQual_group)9 397519 128680
TotalSF:as.factor(OverallQual_group)6 12.83 3.509
TotalSF:as.factor(OverallQual_group)7 23.17 3.782
TotalSF:as.factor(OverallQual_group)8 32.17 5.52
TotalSF:as.factor(OverallQual_group)9 50.69 31.94
TotalSF:as.factor(OverallQual_group)10 -64.14 4.555
RichNbrhd:TotalSF:as.factor(OverallQual_group)6 -206.5 364.8
RichNbrhd:TotalSF:as.factor(OverallQual_group)7 -118.2 15.29
RichNbrhd:TotalSF:as.factor(OverallQual_group)8 -143.5 11.32
RichNbrhd:TotalSF:as.factor(OverallQual_group)9 -98.19 34.61
  t value Pr(>|t|)
(Intercept) -11.35 1.195e-28
RichNbrhd -12.07 5.353e-32
TotalSF 15.39 1.436e-49
as.factor(OverallQual_group)6 -1.653 0.09853
as.factor(OverallQual_group)7 -2.267 0.02354
as.factor(OverallQual_group)8 -1.558 0.1194
as.factor(OverallQual_group)9 -0.4307 0.6667
as.factor(OverallQual_group)10 14.88 1.162e-46
YearBuilt 12.27 5.277e-33
RichNbrhd:TotalSF 17.13 5.419e-60
RichNbrhd:as.factor(OverallQual_group)6 0.7525 0.4519
RichNbrhd:as.factor(OverallQual_group)7 8.022 2.147e-15
RichNbrhd:as.factor(OverallQual_group)8 10.65 1.46e-25
RichNbrhd:as.factor(OverallQual_group)9 3.089 0.002045
TotalSF:as.factor(OverallQual_group)6 3.657 0.0002647
TotalSF:as.factor(OverallQual_group)7 6.128 1.151e-09
TotalSF:as.factor(OverallQual_group)8 5.828 6.917e-09
TotalSF:as.factor(OverallQual_group)9 1.587 0.1127
TotalSF:as.factor(OverallQual_group)10 -14.08 2.864e-42
RichNbrhd:TotalSF:as.factor(OverallQual_group)6 -0.5661 0.5714
RichNbrhd:TotalSF:as.factor(OverallQual_group)7 -7.726 2.077e-14
RichNbrhd:TotalSF:as.factor(OverallQual_group)8 -12.67 5.661e-35
RichNbrhd:TotalSF:as.factor(OverallQual_group)9 -2.837 0.004615
Fitting linear model: SalePrice ~ RichNbrhd * TotalSF * as.factor(OverallQual_group) + YearBuilt
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 28884 0.8698 0.8678

Vizulization of my model:

# ggplot(train, aes(y = SalePrice, x =  TotalSF, color = as.factor(RichNbrhd))) +
#   geom_point() +
#   facet_wrap(~OverallQual_group) +
#   geom_line(aes(y = mylm_80$fitted))

# ggplot(train, aes(y = SalePrice, x = TotalSF, 
#                   color = as.factor(RichNbrhd), 
#                   size = YearBuilt)) + 
#   geom_point() + 
#   facet_wrap(~OverallQual_group) + 
#   geom_line(aes(y = mylm_80$fitted), size = 1)  # Ensures line has a fixed size
train <- train %>%
  mutate(YearBuilt_group = ifelse(YearBuilt < 1900, "Pre-1900",
                           ifelse(YearBuilt < 1920, "1900s",
                           ifelse(YearBuilt < 1940, "1920s-30s",
                           ifelse(YearBuilt < 1960, "1940s-50s",
                           ifelse(YearBuilt < 1980, "1960s-70s",
                           ifelse(YearBuilt < 2000, "1980s-90s", 
                                                   "2000s-2010s")))))))
ggplot(train, aes(y = SalePrice, x = TotalSF, 
                  color = as.factor(RichNbrhd))) + 
  geom_point() + 
  facet_wrap(~OverallQual_group + YearBuilt_group) +  # Facet by both OverallQual_group and decade
  geom_line(aes(y = mylm_80$fitted), size = 1)  # Ensures line has a fixed size


Bellow is the validation prosses of my model. Because the \(R^2\) didn’t change drastically we can be more assured that the model is good.

set.seed(121)
num_rows <- 1000 #1460 total
keep <- sample(1:nrow(train), num_rows)

mytrain <- train[keep, ] #Use this in the lm(..., data=mytrain) it is like "rbdata"
mytest <- train[-keep, ] #Use this in the predict(..., newdata=mytest) it is like "rbdata2"


mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train) 
# pander(summary(mylm_80))

yh_mylm_80 <- predict(mylm_80, newdata=mytest)
ybar <- mean(mytest$SalePrice)
SSTO <- sum( (mytest$SalePrice - ybar)^2 )
SSE_mylm_80 <- sum((mytest$SalePrice - yh_mylm_80)^2)
rs_mylm_80 <- 1 - SSE_mylm_80/SSTO

n <- nrow(mytest)
p_mylm_80 <- length(mylm_80)
p_mylm_80 <- length(coef(mylm_80)) # Number of estimated coefficients

rsa_mylm_80 <- 1 - ((n - 1) / (n - p_mylm_80)) * SSE_mylm_80 / SSTO
# rsa_mylm_80
# summary(mylm_80)

my_output_table2 <- data.frame(Model = c("My Model"),
`Original R2` = c(summary(mylm_80)$r.squared), 
`Orig. Adj. R-squared` = c(summary(mylm_80)$adj.r.squared), 

`Validation R-squared` = c(rs_mylm_80), 
`Validation Adj. R^2` = c(rsa_mylm_80))

colnames(my_output_table2) <- c("Model", "Original $R^2$", "Original Adj. $R^2$", "Validation $R^2$", "Validation Adj. $R^2$")

knitr::kable(my_output_table2, escape=TRUE, digits=4)
Model Original \(R^2\) Original Adj. \(R^2\) Validation \(R^2\) Validation Adj. \(R^2\)
My Model 0.8698 0.8678 0.8727 0.8657




Interpretation

Below is the visualization of the model, where each line represents the different types of houses that the model takes into account.

For example, if a house has an Overall Quality score of 9, a rich neighborhood rating of 1, and a year built of, say, 2000, then you would look at the blue line for the chart labeled 9, 2000s-2010s. From there, you would find the total square footage of your house on the x-axis and then locate the estimated sales price on the y-axis. By doing all of this, you would have a pretty good idea of what your house might sell for.

There are caveats to this, of course, the biggest one being that the residual standard error is 30,350. This means the model is saying that your predicted model price will probably be around the number that aligns with the line, plus or minus about $28,880.00.

Because TotalSF is the on the x-axis anything the only thing that effects the slop is TotalSF and all the things interacts with it. Everything else will be changing the intercept.

# ggplot(train, aes(y = SalePrice, x = TotalSF, 
#                   color = as.factor(RichNbrhd))) + 
#   geom_point() + 
#   facet_wrap(~OverallQual_group + YearBuilt_group) +  
#   geom_line(aes(y = mylm_80$fitted), size = 1)  

train <- train %>%
  mutate(YearBuilt_group = case_when(
    YearBuilt < 1900 ~ "Pre-1900",
    YearBuilt < 1920 ~ "1900s",
    YearBuilt < 1940 ~ "1920s-30s",
    YearBuilt < 1960 ~ "1940s-50s",
    YearBuilt < 1980 ~ "1960s-70s",
    YearBuilt < 2000 ~ "1980s-90s",
    TRUE ~ "2000s-2010s"
  ),
  OverallQual_group = case_when(
    OverallQual <= 5 ~ 5,
    TRUE ~ OverallQual
  ),
  Combined_Group = case_when(
    (OverallQual == 8 & YearBuilt_group == "1900s") |
    (OverallQual == 9 & YearBuilt_group %in% c("Pre-1900", "1900s", "1920s-30s", "1940s-50s", "1960s-70s", "1980s-90s")) |
    (OverallQual == 8 & YearBuilt_group == "Pre-1900") |
    (OverallQual == 7 & YearBuilt_group == "1900s") |
    (OverallQual == 6 & YearBuilt_group == "Pre-1900") |
    (OverallQual == 10 & YearBuilt_group == "Pre-1900") ~ "Combined",
    TRUE ~ paste(OverallQual_group, YearBuilt_group, sep = "_")
  ))

ggplot(train, aes(y = SalePrice, x = TotalSF, color = as.factor(RichNbrhd))) +
  geom_point() +
  facet_wrap(~ Combined_Group) +
  geom_line(aes(y = mylm_80$fitted), size = 1)

*I had to combine some of the categories because of lack of data.